rOC3a2S2 223032 



2D 126 155 3H COS «I12 

Ti5IE Tie Faradci of CrixeEics-B^f^rsncsa lleasorenDea^. 

FS3 E£T£ £^?r 76] 

XC^2 25p.; Paper presented ax xbe lunaal Seetiag of zh^ 

Satioral Couacii cn E-easorcaent is ^^ucatioa (Saa 
Fra::ci£CO^ C^iifornia^ Sprilj. 1976) 

2S25 HC-S1.67 Pios Postage. 

52SC2I5^0HS Acadexic Stanflaxds; Ichie resent I'ests; ^cuparatire 

analysis; *Critericn Beferencea Tests; Decision 
aaiirg; Itean AaaiTsis; Itex Saxpiiug; *Scrii 
referenced ^ests; Sta^idard Erroz of Seasurexent; Ses; 
Coastrcction; Test Isterpretatics ; ^est Seliability; 
!i:est Validity 

Z££S!rjrl£2S ^Soaais Heferenced Tests; 2'est Saecry; Yari^sce 

(Statistical 



'^he existence of criterion-zefereaced (C3) 
aeasureaieat is gaesticned ia this paper. Despite heliefs that 
differences exist between tvo eltemaiiye fcras of measurement^ C2 
and Kora referenced (255) / a:, analysis of philosophical and 
psychclcgicai descriptions cf iEeasureinent, as veil as a growing 
nusber of empirical studies^ reveal that the conaon distinctions 
dravn fcetireen CH ana N2 jneasuressent focus on what occurs prior to and 
following jceasurenaenr^ naanely the writing of itens and the 
interpreting cf test scores. In this respect^ the use of the tersa 
"criterion-referenced aeasureaent" is paradoxical. {£uthor/D£?) 



* Docuaents acquired by 2SIC include »any inforsal unpublished * 

* aaterials not available froa other sources. S3IC sakes every effort ♦ 

* to obtain the best copy available. Nevertheless^ iteas of marginal ♦ 

* reproducibility are often encountered and this affects the quality ♦ 

* of the aicrofiche and hardcopy reproductions 5PIC aakes available * 

* via the ZHIC Dccujsent Se production Service (2DBS) . rD2S is not * 

* responsible for the quality of the original docuaent. Heproductions * 

* supplied by EBBS are the best that can be oade frcE the original. * 



ERIC 



The Pzrzcox of Criterion-Referenced Jleasurecent 

vO 

T— I 

Q 
UJ 



Tcoa Haladyna 
Teaching Research Division 
Oregon State Syszen of Higher Edtication 



41 1 c r#*«rj^c«iTO^»<C'W'^*<' 



ERIC 



A paper presentei at the annual r:eeting of the National Counril 
on Measurezrent in Education, San Francisico, 1976 



2 



Trsiiticnall:--, there hzs reen ens zeasursrenr ccnscmcz, rrst ccCTcniy 
referred to as "'nDr^-referencec'*, I3:e rzeed fcr ziore effective r;easi:res 2>f 
cisssroors achie^•erent coiipled with the interest in using instr:icticr.ai objec- 
tives teaching and testing has z^zlrzz^ the establishnent Df a nev con- 
struct kncoin as "criterion-referenced seasurecent". !iom- referenced (XR) 
r:easures are believed to be ccziratible -ith the study of indi-/icuai differences, 
:-hile criterion- referenced :::easures are :::ore suitable fcr asce-^tainins any 
examinee's lei-el of rerfortitnce vith resrect to a veli-defined schiei-enent 
donain. Ihe classical theory of neasurenent has been found to be unsuitable 
for application to CR n-easures (Pophsa and Husek, 15693, and a CR theory of 
reasurenent has been sought. Paradoxically, an analysis of the distinctions 
ccnnonly dra'-^n between CX and SR r:ea<urenent, coupled vith accutralating test 
data, suggests that there is only one zieasurenent co-is truct with t^-o functions, 
^Xil^and C5, This r are r is' d evoted to an eicasinaticn of this par adox. 

Definitions 

An inevitable problen in any analysis of CH and NH neasurenent is hose 
these terns are defined and used in everyday test practices. The elusive- 
ness of ziany definitions of CR z^easurenent as well as the T^ide Tzr.go of their 
applications had been reviewed by Hsnbleton and Xovick {1975}. The CR test, 
according to Pcphsn and Husek {1969, p. 2) is ".-.one which is used to 
ascertain an individual's status with respect to sot:e criterion." Explicit 
in their conceptualitation of a CR Treasure is that these tests are object i\-er 
based and that a criterion level is ^:i'2loy^i to assign examinees to z^ssing 
or failing categories. CR tests are believed to rseasure achio'/ezent of a 
well -specified dorain ;the drtiain zznsiszs :.f a w^il-defined set of tasl^s^ 
There is no cczcerr. about hew any exaninee scores with respect to ether 
examinees, as attention in testing is fccjS'^d cn hew that exasinee stands 



vich respect to the cr^tire cc=ain c>£ itecs, Ihe variance cf Gl test sccrcs 
is saii to be restricted to the degree that traditicnsi iteia ar^ test 
statistics are useless. 

in contrast, the test is priisariiy designed to sieasure the relative 
standing of examinees- It is intentionally constr::cted to raxi^aize differ- 
ences asrcng examinees, ct^nscnant ^ith the effort to sseasure individual 
differences^ Traditicnal rrethc^ds.^ for estimating iteni discrinjinaticn and 
test reliahility depend upon sufficient variance of the scores of examinees. 
Shen scores are restricted, these estiiaates are attenizated. 

A recent outgrovth of the Cfl test noveisent has been the doicain-refer- 
enced test (BRT). Any DRT is sinipiy a rancon sanpie of itens fron a Keli- 
defined detain of iters ^ This decepti-reiy sis^le definition, however, does 
not capture the essence of a DR7- I*3iiinan (1974) states that the CXT is 
ccnceptually quite distinctive frcn CR and NR tests, both of ichich are con- 
sidered differential assesirent devices as contrasted -^th the SRT Khich is 
a true CA seasure- The DRT is prinarily distinguished from the core tradi- 
tional CR test in terrss of ho*A' itezss are created and hcu- tests are constructed. 
That is, iten-generating aigorithns are used to write test iteas, and iteris 
are randcnly sarpled to test foms. Following these procedures will result 
in sieasures which have no reference to the sanpie of exaninees, but yield 
clear-cut treasures of achievenent within that dcnain for each exaainee. 

In the balance of this analysis, both CR and DR tests will be treated 
as different cases oC CR neasurenent- Distinctions drawn between KR tests 
and these two cases have been sade in the areas of (a) what is sieasurcd, 
(b) how dotiains are neasure, (c) test standards, (d) iten selection, (e) 
reliability, neasurenent error, and decisionmaking, and (f) validity- 

4 

ERIC 



Ihe core of the difference between CR and KR r^easures is saic tc be 
vhar is intrinsically measured- in the context of diassxti^css achiei^esent 
testing, the objective of a XR measure is to ascertain >he rank and rela- 
zire differences for a grcizp of exsninees vith respect to an achie^-esent 
dczrain; the objective of a CS test is to estizste the ie*/el of functioning 
of exaiiinee so that level can be coiisared to the level of acceptable 
performance- Carver (1974) describes this difference as ''the neasurenent 
cf individual differences versus the measurement of the ascimt learned" 
(p. 512)- This distinction has been rade in a nunber of other discussions 
cf CR tests fe.f , Pophaa and Kusek, 1S6S; Andersen, lfi72; Hatibleton and 
Xovick, 19*53- In fact, this distinction pervades al^rost the entire bcdy 
of literattzre on the subject! 

Measurement has traditionally r-eant the obtaining of a n-zaerical 
description cf an object on a trait through the use of sose rules. In 
achie->et:ent testing, this has co^ie to sean the sua of correct responses 
on a test for a sz'cdenz. Ihe object is the student, the trait is achieve- 
sent and the rules involve taking the sun of correct responses. One kind 
of test gives relative information, and the other kind (CR) gives abso- 

lute information. Measuresient has been viewed in ether contexts as having 
two functions: (a) knowing the quantity thereof, and Co) naking fine and 
subtle discrininitions (Kaplan, 1965), and similar observations have been 
r:ade by Crzzitz^h .ir:,, Hbel ri5T5:, and Messick ^373}. Thus on the 



surfaces, th* cots of the cifferencc bctveen Gl snd KX seasures has ireen 
widely discussed as cne of ^"hst is isAerently seaSunred. In actuality, it 
appears that any achic^/eaent zieasure can yield one of tvo interpretations 
cependii:g upcn the risncticn ve vish to ersloy for the purposes at hand- 
"ihis distir.cticn is illustrated by noting the difference oetveen a percent 
and a percentile (Giaser and Klaus, 1962)- Using the abo^'e illustration, it 
should re apparent that any test can yield percentile or percentage inter- 
pretations, and that each provides unique information despite the single 
seasuresent that occurs- 

Hoy Achiy/ezsent Domains Are Measured 

The 5cay itens are created is another vay to distinguish the CR, KR, and 
I5R tests xr03 each other. In a purely CR approach, itens are constructed to 
directly represent instructional objectives. Ihe infonsation obtained fron 
a CR test constructed in this scanner pemits inferences to be drawn about to 
what degree an objective or set of objectives have been achieved. This is 
contrasted -*v-ith a traditional approach where (al the achieverent dosiain is 
abstractly defined, (b) itens ajre v:ritten to represent the construct and (c) 
test results are used to confir:^ our predictions about the construct. These 
procedures say be recognizable as those recoranended for the establishn:ent of 
the construct validity of tests. Konethclcss, the procedures describe what 
goes on ideally in the creation of achievenent tests using the classical 
theory. And it should be clear that test-isaking is often reduced to one of 
introspecti.e and subjective iter: writing to represent sone vaguely-conceived 
domain. In the t% approach, ites:s are created through the er:pio>T:ent of an 
itesi-writing aigorithn such as the ones suggeszed by Borrmth (1970) or Hively 



.(157-) • £'ata has nor been cDlieccec and reported yet that attests t^ the 
distinctiveness of any of these itea-generating approaches as leading to 
unique Pleasures of the sare achievement dc:2zin in question. 

It has been ncted that the 2R test is fomed hy random sat^ling of itess 
frcn an itea pool to test forzs. Despite the recency of SR testing, it needs 
to ce noted that the practice of randcn sazspiing is neither unfaniiiar nor 
antithetical to traditional test zheozy . The tern "dcnain saupiing" vras used 
by >;h:nnaliy (i?673 to describe the classical theory of sieasurenent- In the 
truest sense, the classical theory has in\'oived the random satjpling of itens 
froa a well-defined set (Lord and Xovick, 1963, p. 29) and the need for ran- 
dcs 5S3piirg is explicit in the theory of general i tabi i i ty , as veil (Crcn- 
bach, Sleser, Kanda, c- Rajaratnan, 1372), 

The essence of the difference cer--een CR and testing lies in the 
interpretation of the achievenent do -rain. In the DR approach, the do-ain is 
operationally defined, vhile in traditional testing, the objective is cne of 
inferring an abstract construct- J^hile the use of i tea-generating techniques 
and randcn sampling ray not distinguish a DR test frcsj others, the interpre- 
tation of test results can be clearly CR or construct-referenced. In a 
DR approach, any test score represents an unbiased estinate of perfortiance 
of all itens in that doi^ain- Thus, the interpretation is based upon cur 
operational definition of that domain. In a traditional approach, one in- 
fers a constr-^ct -Ahich is r.ore or less intangible- The acceptance of DR 
testing hinges sn cur vj " lingness to accept an operational definition as 
a best indicator :5f a detain. For example, a d^riain of vord prcblsrs in 
zather-atijs say be created through the use of an itsn vTriziT.% algorith-n. 



is it siifficient to is.o^ ho:*' sany of 3,11 problems zny stisdcnt can corr;^tiy 
solve, or do ve %.ish to kno^- abcist sorsething aore intangible and seeaingly 
remote, such as arithmetic reasoning? "H^e issue presented here is nore in 
the realn of values and meaning, and interestingly enough, this is^je in- 
^-oi^-es scsething that occurs after the process of seasuresent. 

One study vas recently corsjleted ichich bears inportantly on the issues 
of hoiw- dor^ins are Pleasures. Eoid and Kaladyna (Note 1) contrasted t^"o 
types of iten- writing procedures, one purely DR, the other CR. Eortnith's 
iten-writing rules were used to construct a subtest over a 32 page, learner- 
verified progranned text, ;*-hiie another subtest was prepared using instructional 
objectives for the saz:e material . The tests vere adainistered prior to and 
following instrsiction to a group of students. One itea writer consistently 
produced iteas of greater difficulty than the other itea writer, and both 
iten writers produced roughly the sace nunber of faulty itens regardless of 
the itea-writing approach. In fact, both DR and CR iten-writing approaches 
unexpectedly produced the saae large nuaber of faulty itens t*:at one could 
expect froa using the traditional, subjective approach to itea writing- A 
subsequent experiment is currently in progress where a NR approach is being 
coapared to the CR and DR approaches. In light of the present stage of de- 
velopinent of itea generation theories and the rather negative supporting 
evidence, it appears prenature to conclude that CR and DR tests offer dis- 
tinctive and superior measures of achievenent. 

Standards 

In DR and CR tests, a standard is t>-pically used to assign stuccnzs to 



7 



a passing cr failing category. Establishing a point on any test scale is 
an acrion vhich is independent of the ceasurecent process, and something 
that can occur for any test. Doing so does not nake a test CR or DR. It 
is zierely using the test score to determine the worth of a student's learn- 
ing effort or the value of the instructional program. In fact, it should 
be considered a'CR use of a test score. 

There is a siore subtle and important piroblen with standards and the 
questionable existence of CR tests. Returning to the study by Roid and 
Haiadyna (Note 1) where two types of CR test itea-generating techniques 
were contrasted, clear-cut differences in the iten difficulties of items 
written by the ti<o item writers led to the creation of two scales, one hard 
and one easy, :>hich were both reliable nieasures of the same achievenent 
doisain. Administering these forns to a group of students following instruc- 
tion would create sone serious problems in assigning students to pass or 
fail categories- Students receiving the hard test would r:ore often be 
falsely categorised as passing. The use of either item-writing algorithm 
failed to reduce this difference between the item writers. The implications 
for this state of affairs is perplexing in light of the fact that CR measures 
are reputed to produce unbiased measures of student achievement. If either 
type of CR test is to be distinctive from typical NR test, the quality of 
items should be uniformly high and the difficulties be substantially reduced 
between various item writers. This has not occurred, and a standards prob- 
lem continues to exist. 

Test Variance 

There has been considerable debate over the role and extent of test 



8 



score variance in CR neasures (Kcodson, 1974a, 1974b; Haladyna 1974; Milljnan 
and Pophan, 1974). Pophan and Husek (1969) maintained that the neaning of 
scores flows frrm the item-objective congruence and not the notion of indivi- 
dual differences of which variability is a related concept. The variation 
of CR test scores is said to be restricted when learning and instruction is 
effecti'v'e, and this seriously impedes the use of tradition item and test 
statistics. 

Woodson (1974a, 1974b) has supported the idea that variability of test 
scores is a function of the sample of examinees. This has been clearly 
demonstrated in at least one study (Halad>Tia, 1974) and is logically realized 
when one considers the situation where a group or students have not learned 
content fron a domain and another group has learned the content quite well • 
Resulting achievement test scores should be bimodal with the concentration 
of scores at the top and bottom of the achievement scale. The variability 
of theso test scores is quite large. iVhen high group averages 100 percent 
and the low group averages 0 percent, it can be deduced that the variance of 
test scores for the two group (when equally matched for sample size) is as 
high a<5 possible. Therefore, any CR test could have substantial variance. 

The usefulness of variability is questioned in CR tests by Millman and 
Popham (1974) in a rebuttal to Woodson^ s initial comments* The argument 
follows from the basic distinction made earlier, that CR tests are concerned 
with how a student has achieved rather than how different students are. The 
primary speculation about test variance appear to be what the role of variance 
in such test<^ should be. Perhaps the point of contention with variability 
involves the notion that a CR test is appropriate for determining how much 



9 



a student has learned while a NR test is more appropriate for measuring 
individual differences. The measurement of individual differences, as 
earlier noted, is a function of all interval or ratio measurement scales. 
It is clear that CR and DR tests have substantial variability when one 
samples high and low achievers. The very sa.-ne is true for a NR test. It 
may be more appropriate to say that any test is open to interpretations 
about individual differences if that test is sensitive to the trait being 
measured. If the test isn't sensitive, it probably contains too much 
measurement error to be useful for anything. 

Given that variance can be substantial in any classroom achievement 
test when the sample obtained spans the full range of achievement, how do 
traditional item a..d test statistics work with these CR and DR tests? It 
is with item and test statistics that CR and NR tests should distinguish 
themselves. 

Item Selection 

There is a difference between the role of item analysis in CR and DR 
testing. Traditionally, CR item analysis has been one of ascertaining item 
quality in light of instructional sensitivity. The most commonly used CR 
item discrimination index is one derived by taking the difference between 
item difficulties of the item administered to pre- and post-instruction 
examiness (Cox and Vargas, Note 2). A number of empirical studies have been 
done comparing various indexes with the Cox-Vargas coefficient e.g., Rahmlow, 
Matthews, and Jung (Note 3); Popham, 19.72; Tsu (Note 4), Haladyna, 1974; 
Kelmstadter (Note 5); and Crehan, 1975). The scope and limitations of most 
of these studies were recently discussed in a study by Haladyna and Roid (Note 6) . 



11 



10 



In DR testing, empirical item analysis should not be used due to the 
possibility that the interpretation of DR test scores is destroyed when 
items generated for the DR test pool are tampered. Millman (1974 , p- 339) 
states: 

The use of item statistics destroys the random selection process, a defining 
characteristic of DRT's. Unless items are selected randomly, the estimate 
of a person's level of functioning loses meaning and interpretability of the 
test score is reduced. 

A further criticism for the use of instructional sensitivity was offered 
by Cronbach (1975) who stated that, useful items would not be sensitive to 
instruction and thus falsely discarded. Such items might be transfer items 
which should display a high difficulty index both preceding and following 
instruct ion - 

Tnere are a host of compelling reasons for the use of item analysis, 
and these reasons need to be established before the evidence for the dis- 
tinctiveness of various CR item selection procedures can be discussed. First, 
the item writing procedures which distinguish DR tests from all others have 
not been sho>vn to produce uniformly high quality items. The study by Roid 
and Kaladyna (Not^ 1) indicated that a DR procedure produced as many faulty 
items asaCR procedure. In fact, the number of faulty items produced was 
comparable to the .number of faulty items one would expect from using the 
traditionally subjective approach. Second, how doing an empirical item 
analysis destroys random selection is unclear. It seems reasonable to employ 
an empirical screening method to weed out faulty items as they surely seem 
to exist in any item pool. Random sampling is done from a item pool which 
has been gleaned of faulty items. Third, how empirical item analysis destroys 
interpretability is also unclear. Defective items are one source of measure- 



il 

nent error. To rid iten pools of r^asurcscnt error can only improve the 
precision of test scores as veil as interpretabiiity- Finally, alternatii^e 
and non-eiz)iricai procedures for itezi analysis have been advocated by 
ad\^ocates (e-g-, Hasibleton, et ai-, Xote 7 )- Ihesc procedures im-oiye 
ccmittees of content e:q:erts ^ho judfe itens for their appropriateness • 
These r.on-erpiricai procedures are aligned sfith the logical analysis that 
precedes itea pools and tests- There is no available eviucnce of the 
soundness of these non-crpirical procedures on or CR itesns, and there 
is reason to believe that these procedures are nothing saore than tradi- 
tional approaches to establishing content validity. 

Perhaps a ssore cospeliing reason for esupirical-itea analysis is a 
consideration of the inferences one draws froa test data and its basic 
unit of measure, the iten- Ihis reason is also at the core of the supposed 
difference between Cft and DR measures- Thomdike (in Jackson S Messick, 
1S67, p- 205) stated that "Each iten is in a very real sense a little test 
all by itself. Each iten nust necessarily be judged on its cim merits as 
far as \'alidity is concerned-" Traditionally, iten validity has been the 
correlation between iten and test perforsiance for a group of exaninees, 
the itea discrinination index- If a CR test, as Pophan and Kusek (1969) 
describe it, is sensitive to treatments (i-e-, instruction), ve e^roect pre- 
instructed students to score low and post-instructed students to score 
high, Messick (1S7S, p- 959) has conveyed a sinilar inpression when he 
states: 

the niost sensitive and soundest evidence is likely to cone fron experi- 

nental studies of groups receiving different instructional treatnents or 

of test adninistrations under different conditions of notivation and strategy- 

ERIC 



12 



Ihe i:i5tn:ctirr^l sozsitivizy eight s?ply net only to test scores, l^ut 
itc2 ciffioilties as veil. Since the iten is the sjisioguc of the test- 
SEpiric?! ites analysis appear to hz a necessarj feature of any achie-/e- 
rent test regardless of its ?i:rposc, Gl or DH; and the Cox-Vargas index 
cores conceptually cl:>5e5t to sieasuring CR iten discrimination- Is this 
index actizaiiy distincti^^e froa others? 

The discussion up to this point in this section has been directed at 
the nc-cessity of itea analysis in DR zesting, in d testing, it has been 
advocated for ^^orie titie (Jlosecoff and Klein, Note S; Karris, ^ote S)- 
Koiv-ever, the traditional point biseral correlation vhich serves as an 
itezs discrinination index has been criticired due to the -a'ariance problem. 
•Whenever a correlation is cc^nmted £ron a restricted range of scores, that 
estirate of relationship is attenuated. Haladyna (1974) used sanples of 
pre- and post -instructed students to corgjute traditional iten statistics, 
which compared \-ery favorably to CR iten statistics. Kaladyna and Roid 
prote 6) exanined a host of CR and other iten statistics vith a CR test and 
discovered that all ^ere uniformly highly related. These included a Bay- 
sian index (Helnstadter, Note 5), a Rasch instructional sensitivity index, 
the Cox- Vargas coefficient, and the fuii-scaie traditional iten discrinina- 
tion coefficient (Jialadyna, 1974}- The intercorrelations arong these 
statistics approached unity, 1-00- Thus, it vould seen that all four 
statistics provide identical inforzation- 

The contention that CR and DR tests are unique is not supported '^hen 
examining the rational for iten analysis and the enpirical ei'idcnce for 
itea discrimination indexes- Perhaps the reason for this lack of support 



ERIC 



14 



13 

can be fcir:d iy closely ejcaainin* a distiricticn drav-n by Miilsan Ci9'"4) 
«hen he labeled typical Gl and tests as "differential assessment dei'iccs" 
(SSB). in contrast, the ERT is not a DAD- ^ftenei^er ^cizp or individual 
differences are considered, the concepts of ^•ariabiiity and traditional 
itea and test statistics are quite acceptable- And it foilovs the CS as 
^-eii as the KR test are OAB's- Ko^ferer, it should be clear that these 
concepts apply to the ER test as veil; and vhen traditional statistics 
are e^ioyed, the DRT rese:i3)les all others- Ihus the CR and DR tests, 
as defined in the ?aper^ produce test results that lend thejnselves to 
conventional iten analysis- in the area of iteia selection, the distinc- 
tions draisn asong CR, DR, ar-d ^'R tests don't withstand eraixical tests - 

>!easureaent Srror, Reliability, and gecisionnaking 

Kith any test, a certain degree of sieasureisent error vill occur- 
Typically the construct of reliability assists us in understanding hovr such 
reasurezient error can be found in the test responses of a group of examinees - 
Ihe probien in instruction is knowing hov such error exists «hen deciding 
who has passed and who has failed- Invar' ^bly, a number of exaninees scores 
will fall near the passing standard, and the risk of sisciassifying these 
persons is high- There have been se\-eral suggestions to establish confi- 
dence bands aroinid the passing standard and SLSsig^ passing, failing, or 
conditional status to exaainees based on this confidence band (Hanbieton 
and Xovick, 1975; Millnan, 1974)- The procedures nininize the nisciassifi- 
cation errors that often occur - 

5Chat is required is a statistical procedure vhich pensits the valid 
estiriation of a standard error fron which the confidence band can be 

15 



i4 



ccnstructea- The hclicvec Icfv t^risbility of £31 ztA £2 test scores has led 
sore perscr^s to reject classicai xeliabiiiry estimates (^-g-, PorJiaa and 
iiusck, IScS). 55ovever, reliability vas shoiin to be xeascnabiy estirstcd 
frca C02:4)ir.e£ sar5>le5 of pre- and post-instnscticn examinees in one study 
by Haiacyna (Sote, 10). Ai though reliability estiriates are dcper.ccnt aipon 
variance, the esti:=aticn of xeasurcs:ent error in traditional test theory is 
net a rumction of the ^variability of test scores. Iherefore, the tradi- 
tional standard error of neasurecent can be usefully estisiated for any test 
incitrding putati^-^e ER and CR tests. 

Other procedures hai^e been reco^sended as alternatives. Cne of these 
is a straight forward iters sazpiing approach sphere the binosial distribution 
is eraioyed C-JiH-^a^:, i374b5. Traditional and ites sampling approaches were 
ccupared ^n one study {Haiadyna, Xote 10) with a slight superiority for the 
traditional approach. However, both approaches were found to be lacking in 
te-ns of feasibility with student populations- Baysian technitr^es and a 
traditional approach were cozipared in a Monte Carlo study by iiasbleton, 
'hjtten, and Swaninathan (unpublished). Wiiie one Baysian technique sho^cei 
a superiority to others in the accuracy of decisiorsnaking, the differences 
were slight. 

Cne cxanple of the usefulness of traditonal reliability estinates for 
CR tests can be found in the statewide assessment of fourth grade sathetiatics 
in the state of Oregon (rialadyna, 1976). Following procedures similar to 
those recosxtended by Millssan (1S74)- a content panel, represented by tiathe- 
natics educators, was established to jtid^o the congruence of iteas with 2S 
instructional objectv/es- All data v;as snbjccznd to traditional itea and 
test analysis- Scales for objectives were classified into five achievenent 



15 



dczains srJ, '^cre cuite r^lizblc as jud^ei by traditicnal SR-IO estimate?. 
Variance vas nox restricted, and test results appeared quite sisilar tc 
those one aight obtain in any rathcaatics achievement test of fourth 
graders- Interestingly, this test is CR by virtue o£ the ifay it vas con- 
structed, and Kl in the loosest sense, and yet, traditional statistics 
2*ere usefully cs^^loyed to gain an understanding about hoM ra:ch seasurerient 
error occurred vith their fourth grade sansie. 

Ihore are three salient obsen^aticn which follow fron this discussion 
of reliability, seasuresent error, and decisionmaking: 

1- Traditional reliability estimates hzvc been usefully enr>loyed in 
Ca and SR tests to estimate standard errors- If one can Judge by these 
iitsited nusber of studies, the traditional approach is slightly superior to 
the iten sailing approach and slightly inferior to the Baysian approach. 
TCnaz is conclusive fron these empirical findings is that ail three approaches 
lead to s^easures of the sane construct, test error. And ail approaches 
(e-g-, Saysian, Rasch, traditional, and item satpling) are based on the con- 
cept of true scores - 

2- Any test aay be used to aake decisions and can, therefore, be CR 
by sinply establishing a decision point and using it accordingly. There- 
fore, use o£ passing standards does not distinguish a CR or DR test froa 
a Nil test. 

3. Regardless of the ways tests are categorited (i-e., CR, KR, or SR), 
it seens clear that reliability estituites and standard errors are very 
comparable regardless of the procedure used to obtain these estinates^ If 
the CR, DR, and >;R tests are tmly distinct measurement constructs, NR 



ERIC 



17 



z^TCZchcs to reasiarenicnt error, reliability, ana decisionr:akang should be 
ineffective. Sut cicariy they are as ef£ecti^^c as any ether approach 
inciudi;:g those specifically created for these CR and ER z^sts^ 

Validity 

A ca test is const ructetl to rei^eal zn exaninee's relationship to a 
behaa-ioral repertory (Glaser, 1963) or to c:casi2re an e^caninee's standinj 
^ith respect to a critericn le^^el fPcrhaa and Husek, 19693- If J?R, G, and 
SR tests are indeed different, then ho*^ might they differ in terns of 
^-alidi ty? Are conventional concepts of i-alidity applicable to DR and CR 
tests? 

Cne of the i:nicue features of UR tests is the strict adherence to the 
iten-vriting aigorithn and the random sazpling of itens to test foms. 
Shiie these iten-ifriting procedures hai^e reached an operational iei-el^ 
they are by no sieans unfaniliar- As noted .earlier, the need to clearly 
define donains, to constnact iteas representing the doriain,;)and to randcnly 
satn^le test foms ha*-re been hallmarks of classical test theory* Ihe ran- 
don sarpling of itens to test forss is actually one vesry desirable fora of 
content validity called "saigsiing ^•aliditJ''' by Helnstadter, (1965)- Shat 
is unfortunate is the lack of attention given to the principles espoused in 
classical test theory- Seldon has achiet-crient tests in the past been care- 
fully constructed to represent donains and randoniy assigned to test foras. 
Xunnally (1567) admits to the fact that traditional theory is not true to 
life. (Or perhaps test practitioners are not true to theory). 

\ DR approach to iter* analysis has been the use of content panels to 
jt:dge the iten-dotiain correspondence (Miiitian, 1974; Jii^nbieton, et ai., (JCbte 



ihe use of svLZh zznz^nz experts is actually 3 typs of face ^-aii£iry, the 
weakest and least J-^stifiable forza of content validity according to Hela- 
stadter {£564}. 

Ine rcie o£ variarxe in CK and tests had led to the conclusion that 
conventional ccrreiaticn-based statistics are t;oically useless. As a 
ccnseziience, the role of predictive validity vas said to be quite United. 
Iz one considers the potential of using pre- and post -instructed sz'::denzs 
in studies o£ predicti-/e -validity, there is no restriction of test scores 
and predictive validity s:ay be usefully employed. For exssple, successful 
sz-iidenzs should have a high probability of success in future units of in- 
struct isn or cn a task froa «hich instruction zsas designed and unsuccessful 
sz'cc^r.zs should not have as good a chance. 

There is a asore coupe! ling reason to szzzdy predictive validity in the 
context of systematic instruction. Shere the passing standard is set deter- 
mines '^ho viil pass and who sfiii fail. Instruction is planned to establish 
high degrees of achievement in students for the purposes of either continuing 
in a sequence of instructional units or giving evidence of corpetence so that 
exar^inces nay perform a task or series of tasks. Setting high standards vill 
ensure higher levels of achievenent Kith the risk of obtaining greater fail- 
ures and i:ore frustration on the part of students. IVhere the ideal criterion 
level is and how to ziaxinite the success of students in future endeavors are 
probieiss of predictive ^-alidity. However, the establishzient of a criterion 
level is not a distinguishing trait of a C?. or DS test. In the truest sense 
of the '-»ord, it is a C?* use of a test. 

The need for c^re concern for the construct validity of educational 
schieverrent tests was expressed by y^ssic//. (1S75, p. 957): 



IS 



ERIC 



.--ail seasurerent shoul^i be constn:ct referenced. A sieasurc est irrates 
hcv :2uch of scr^cthing an indivictsal displays or possesses. Hie basic 
question is, %at is the nature of that so^zethins? It si^ay be answered 
by referring to evidence insupport or particular attributes, processes 
or traits constructed to underlie and cetemine task perforsiance. 

Xhile it is an eiccuent plea for greater concern for the inferred constn^ct 
behind each achievecent measure, the opposing approach, epitomized in DR 
testing, is etruaiiy ccrpeliing. Ihe essence of the disagreerent is based 
in our --^illinsness to accept interpretations of achievement in strict 
behavioral language. Or, as Cronbach and Meehi (1955) contend, vhen our 
operational definitions conflict, one is cota:ellcd to becose concerned vith 
the construct validity of our test interpretations. Borrowing an exarrple 
frozj Millsan (1974 , p. 321} a DRT can be constructed to ascertain if a 
student can soi\'e profit and loss word problens. Khile a conain can be 
defined algorithsicaliy, is this sufficient to define the doaain of mathe- 
matics achievesent of which we are interested- The differences here are in 
the reala of a philosophy of scientific inquiry and well beyond what is 
intended hc-re. Regardless of one's stance on interpretation, it is clear 
that traditional concepts of \'alidity work well in the context of syste- 
Eiatic instruction where achievement tests are geared to the learner - 

Conclusion 

Despite the aany efforts to construct a theory of CR sneasurenent, there 
has been understandably little progress- Perhaps the r^iecn d'etre is that 
there are really not two or three different neasurenent constructs, but only 
one. That one construct has two prir^ry functions: (a) the first is know- 
ing how r:uch of that trait an exasinee possess~CR; and (b) the second is 
knowing how different one examinee is fro:3 another— XR. 

9^ 20 



ane conrentions that CR and XS tests are distinguished by the way 
itcss are constructed (i.e., iten-vriting aigorithas) has not cspiricaiiy 
been supported- In fact, DR tests look and behave like any other test of 
the sace detain «hen adninistered to a equivalent or saroe group of exaniners 

The belief that \'ariancc is greatly reduced in CR *^sts ^-hen compared 
to XR tests is also quite unsupported. Sihat does occur tn effective instruc 
tlon happens to any test which is geared to the content begin taught. Thus, 
the restriction in range of achievement test scores of the post -instruction 
students is a function of the instruction and not the test- And, it has 
been denonstratcd that variance is actually naxiaixed in situations '^-here 
the tests are directly geared to the instruction that occurs. 

The role of variance and the variance-reliant statistics has also been 
questioned by nany CR advocates- Kith I'ariance not restricted as originally 
believed, traditional reliability and item discriaination indexes can be 
usefully estinated. hTien they are conputed, they are found to be quite com- 
parable to statistics uniquely coripatible to CR and DR tests, thus giving 
credibility to the argument that a host of reliability and item discrinina- 
tion procedures lead to measures of the same constructs, sseasurenent error, 
and iten quality- 

In effect, any achievesicnt neasure is sinply that- It is neither NR, 
CR, nor XR. The advent of the CR test and later the DR test, snay be an 

reaction to what traditonal test theory has evolv^ed, a degenerate use. 
That is, ciassrcon teachers are unable to cope with the intricacies of test 
theory and the denands to construct and analyze classroon achievenent tests 
in the recor^-ncnded vay. This has lead to testing practices which are 
actually reproachable and have conie to be labeled "XR". It is undeniably 



cicar that classrooms achie^-enents tests hzrc in the past and in the 
future be nisuscd- The aiovesent to^card CR and SR testing has created sa 
interest in unifying instruction and testing- For the aost part, this 
creates testing ai'hich is 5*'ell suited to the needs of effective evaluation 
of instruction and student progress. It dees not constitute a now fora 
of rieasureijent , as the arguisents presented here and accumulating test data 
has and -viii continue to attest- 



S^rE3E!:CE XGTSS 



tten 'sariyiKc techiique^. A ?:^er presented at the annual needing of the 
Arerican Educational Research Association, San Francisco, 1976. 

2. Cox, 2. C, S Vargas, J. .4 c^apcri^cn i?j itxx: zelecticK tecazicues fcr 
rjovri'-'Tefeverced aid crr.tevicn'-refeTenced tests. Paper presented at the 
annual neeting of the Aiserican Educational Research Association, 1966. 

5. Hahaiov, H. ?., !-!atthews, J. J., 5 Jung, S. Jtn ^np^ri-rrX iK-Jestigtzticn 
of ii^ azatysis in ::ritericn-refereKcsd tssts^ A paper presented at the 
annual meeting of the A-nerican aiucationai Research Association, T-Jinneapolis, 
1970. 

-i- Hsu, T. zr^^riozl datz czr.z^rryCn-VBf^v^^c^d tests. A paper presented 
at the anniiai meeting of the A^nierican Educational Research Association, 
Xew York, 1971. 

5. Helsstadter, G. C- A ca^ipcrisoK of Bcysicn crd traditicrjzi irjzexes of tesv 
ite:n effec tivemss - A paper presented at the annual seeting of the rJationai 
Cousicii on Measurement in Education, Giicago, 197-1. 

6- rlaladyna, T. M., S Roid, G. The quality of dcrzcin-^TefeveriZced test ite:r,s. 
A paper presented at the annual sieeting of the American Educational Research 
Association Meeting, San Francisco, 1976. 

7. Hasbleton, R. K., Svassinathan, K-, Alsing, J., 5 Couison, D. Critericn- 
r^fereKced testir^ ard rieasurexent: A review of technicaZ issues end 
deveZcrrsents. A synposiun presented at the annual seeting of the Ai:erican 
Research .Association, Washington, D.C., 2975- 

3. Kbsecoff, J. 3., £- -Klein, S. ?. InstructicKc:l sensitivity statistics 
appropriate fcr cbjective-tased test iters. A paper presented at the 
annual :::eeting of the ^rational Council on Measurenent in Education, Chicago, 
1974. 

9. Harris, C. J>*. Techniques for cjzalyzing test response data. A paper pre- 
sented at the annual tieeting of tre Anerican Educational Research Associa- 
tion, 5vashington, D.C, 1975. 

10. Haladyna, T. M. An investigation of full and subscale reliabilities of 
criterion-referenced tests. A paper presented at the annual neeting of 
the .American Educational Research Association, Chicago, 1974. 



ERIC 



Andersen, 3- C- Hov to constnict achievement tests to assess comprehension. 
E^vz^ of Esucnii^jil F.e:^ccs^ch^ 1972 , 42 j 145-170- 

Bormith, J. R. Cn ike theory of cjscievcnent test itera. Giicago: Uni^^ersity 
of Chicago Press, 1970. 

Carver, R. ?. Bfo dimensions of tests- -Psyxhosetric and edusetric. Ar.erican 
Pc'^chslccistj 1974, 29^ 512-518. 

Cronbach, L. J. Review of "On the theory of achievement test itens", Fsychc- 
m^i>ji, 1970, Zz, 509-511. 

Cronbach, L. J. Dissent froa Carver, kzeTican Psyaholcaiotj 1975, 39^ 602-605. 

Cronbach, L. J. 5 Meehi, P. E. Constziict \^alidity in psychological tests. 
Psychological 3ulletinj 1955, 281-302. 

Cronbach, L. J., Gleser, G. C, Nanda, H. , 5 Rajaratnan, K. The dependcbility 
of zehaDioval r.easur events. Xew York: John Kiley, 1972. 

Ebel, R. Evaluation and educational objectives. Jsuz^jzI of EciicaiicKcl 
:^easurer>:^fzt^ 1975, lUj 275-279. 

Giaser, R. Instructional technology and the neasuresent of learning cutcones: 
Scne questions. Ar.ericaK ?sycrx>lcgistj 1965, 25, 519-521. 

Giaser, R. , S Klaus, D. Proficiency seasurenent : Assessing hunan perfornance, 
pp. 419-474. In R. M. Gagne, (Ed). Fsychclccical prinaiples in systen 
development J New York: Holt, Rinehart 5 Kinston, 1962. 

}5alad>'na, T. M. Effects of different samples on iten and test characteristics 
of criterion-referenced tests. ^Jc-uTKal of Educational :<easvirer.snij 1974, 
llj 93-99. 

i!aladyna, T. M. TechrXcal report of the pilot cssesanent of natherzztias in 
erode 4j Oregon. Monnjouth, OR, Teaching Research, 1976. 

Hasbleton, R. K., 5 Xovick, M. R. Toward an integration of theory and method 
for criterion-referenced tests. Jourrxil of ZCv.catiorjxl Ueasurer.entj 1973, 
10, 159-170. 

Hanbleton, R. K-, Hutten, L. R-, 6 Swaninathan, H. /. comparison of several 
methods for assessing strident riastery in oooective- cased instructional pro- 
croTtS. Unpublished. 

Kelnstadtcr, G- C. Priyiciples of psychological r!easure:r:ent. New York: 
Appleton-Century-Craf ts , 1964 . 

Kively, Introduction to dor^ain-referenced testing. EducatiorjzZ Tecrjnolcgyj 
1974, 14j 5-10. 

2<; 



ERLC 



Xapisn, A. r<e rrrsci^dsi of i>:q:*i2*y. San Francisco: Chandler, iS64. 



Lord, ?. 5 Xovick, M- R. StzHziUal thecri^:; of rie^itzl tezt zccvcz. 
Reading, J-Jass.: Addison-vresley, iS6S- 

Miiinan, J. Criterion-referenced neasuresent. In J- Pophaa, (Ed). 

Ei^ciZ'^zicK in ^cation: Cu^ent crvlicaticr^. San Francisco: >3cCutchan, 
1974. 

Messick, S. The standard probien: Jleaning and values in seasurenent and 
evaluation. Ajr^evisan Psycholccistj 1975, cCj 955-966- 

Miiinan, J-, § ?opha.-3, if. J. Ine issue of iten and test ^rariance for 
criterion-referenced tests: A clarification. JcumaZ of zcucaticnai 
:':eCiOi^€!r.e}Zt^ 1974, 11 ^ 157-133. 

Xunnaliy, J. C. ?3':jchc^^etvic ikeovy^ Xcv York: McGraw-Hill, 1967. 

Pophaa, 5f. J- Indices of adequacy for criterion-referenced tests. In ft'- J. 
Pophan, (Ed). Crs^2r£i;/2-r5/sr^Ji:craa .^:2c:5r£^ Engl evood CI if ts, NY: 

Educational Technology Publications, 1972- 

Poohan, K. J. Selecting objectives and generating test iteas for objective- 
based tests. In C. W. Harris, M- C- .Alkin, 5 If. J. Pophaa, (Eds). Pvob- 
le^rjQ ir. cvitei*ioK-refcTBKCcd reasurer^Kt. CSE sonograph series in evalua- 
tion, Ko. 3, Los Angeles: CSE, 1974. 

Pophan, J., g Husek, T. R. Triplications of criterion-referenced measuresent. 
Journal of Eduaaticnal >lec:surer,ent^ 1969, 3^ 1-9. 

Thorndike, R. M. The analysis and selection of test itens. In D. N. Jackson 5 
S. Messick, (Ed). Problerts in hur.an assessr.e-nt. New York: McGraK-Hiil, 
1967. 

Scodscn, M. I. C. £. Tne issue of iten and test variance for criterion-ref- 
erenced tests. Jci':rKal of Educaticrxil Mecsuvenentj 1974, 11^ 65-64, a. 

'rfoodson, M. I. C. E. The issue of iten and test variance for criterion- 
referenced tests: k reply. Joia^rjcTi. of SducaHonai I^easurer.entj 1974, 11 j 
139-140, b. 



